January 10th 2025

The SchoolDataIT R package

  • SchoolDataIT is an R package to Retrieve, Harmonise and Map Open Data regarding the Italian Education System

  • Compiles and displays the available data sets about Italian school system, with a focus on the infrastructural features

  • Scope of this line of research:

    • Encourage Multidimensional analysis of the state of the school system
    • Take into account spatial information in this analysis

Context: the AMELIA platform

  • The present R package is meant as a contribution to the AMELIA data platform, managed by the GRINS foundation

  • Data platform for the transfer of knowledge and statistical analysis

    • Broad repository hosting several categories of administrative data from different sources
    • Available to either private, corporate or academic users
    • Data organised at the territorial level of municipalities (LAU)

The importance of school infrastructure for learning processes

  • Drawing increasing interest in scientific literature

  • Foundational work in the international context: systematic review by World Bank (2019)

    • Comparison across several countries, both OECD and developing countries
    • Material factors account for a symbolic “third teacher”

Barrett, P., Treves, A., Shmis, T., Ambasz, D. (2019). The impact of school infrastructure on learning: A synthesis of the evidence. International Development in Focus;. Washington, DC: World Bank.

The importance of school infrastructure for learning processes

  • Foundational work in the Italian context: working paper by Bank of Italy (2023)
    • Follows a multidimensional analysis
    • Detailed comparison of infrastructure endowment at the regional level → Northern regions are typically advantaged under many dimensions
    • Managing territorial gaps is a complex task for policymakers due to the allocation of competences between State and local authorities

    Bucci, M., Gazzano, L., Gennari, E., Grompone, A., Ivaldi, G., Messina, G., Ziglio, G. (2023). Per chi suona la campan(ell)a? La dotazione di infrastrutture scolastiche in Italia (For whom the bell tolls? The availability of school infrastructure in Italy). Politica economica, 1-50.

The importance of school infrastructure for learning processes

  • School infrastructure influences learning processes through several dimensions:

    • School accessibility: availability of transport, schools size
    • Indoor safety: enhances learning continuity
    • Outdoor safety, from hazards like pollution of earthquakes
    • Availability of Learning spaces (IT rooms, laboratories)
    • Availability of Recreational spaces (e.g. canteens allow for full-time schooling)

The SchoolDataIT R package

  • The SchoolDataIT R package has been developed in the spirit of the aforementioned line of research

  • Data gathered from different open institutional sources

  • Aim: detecting areas of educational vulnerability → planning of appropriate development policies

  • Package relies on the tidyverse environment → output datasets easily exported

  • version 0.2.2 on CRAN (now compiling version 0.2.3)

  • Constant maintenance on Github (version 0.2.4, to be submitted to CRAN)

The SchoolDataIT R package

Main function modules

  • Get_: input scraping. Information is not altered and the user receives a data set as close as possible as the provider releases it

  • Util_: auxiliary functions; mainly data manipulation or quality checks

  • Group_: data aggregation at the relevant territorial level

    • NUTS-3/Province
    • LAU/Municipality
  • Map_: displaying

    • Static maps (vector format): easy to export
    • Interactive maps: preserve more information

Main datasets

School buildings database

  • Main source of information on school infrastructure

  • Regards several dimensions, such as:

    • Environmental context, environmental/administrative restrictions, acoustic insulation
    • Accessibility through private or public transport
    • School area surface, building volume, age/period
    • Intended use of learning and recreational spaces
    • Heating systems and energy saving device
    • Overcoming architectural barriers
    • Seismic risk and seismic design

School buildings database

  • Example: synthetic indicator of public transport availability ← Sensitivity-based weights\(^1\) (using R package CompindexR)

\(^1\) 0.254 Urban transport + 0.158 Interurban transport + 0.328 railway transport + 0.260 School buses

School buildings database

  • Example: proportion of primary schools provided with a school restaurant

Students counts and class size

  • Proportion of primary schools allowing for full-time schooling

Usage example: classroom size and Invalsi scores

  • Suppose we are interested in studying the link between class size and students outcome in Mathematics

    • Tipically negative association in literature
  • Scope: last year of middle school, year 2022/23

  • Students outcome measured by Invalsi scores (spatially homogeneous indicator, high-quality datum)

    • Mean = \(200\) points; between-students s.d. = \(40\) points
    • Available for \(\approx 2000\) municipalities (data downloaded in December 2024)

Usage example: OLS regression

  • Simplest analysis: OLS regression of municipality-level Invalsi scores on municipality-level average classroom size

\[ y = \beta_0 + \beta_\mathrm{nstud} X_\mathrm{nstud} + \varepsilon \]

\(y\): Invalsi score; \(X_\mathrm{nstud}\): classroom size; \(\beta\): regression coefficients; \(\varepsilon\): Gaussian error

→ Puzzling result: positive association (\(3.236\) Invalsi points) → Does classroom size incorporate additional information we are overlooking?

  • example: demographic dynamics: classrooms are more crowded in infrastructural poles than in peripheral areas

Usage example: OLS regression

Distribution of classroom size

  • Distribution across the \(6\) classes of the inner areas taxonomy (A, B: pole; C: belt; D: intermediate; E: peripheral; F: ultra-peripheral)

Usage example: spatial model

  • Take into account additional information
    • Spatial location of municipalities
    • Inner areas taxonomy (central/peripheral)
  • Model: \[ y = \beta_0 + X_\mathrm{nstud} \beta_\mathrm{nstud} + X_{I}\beta_{I} + \beta_{\ell} \ell +\beta_{\phi} \phi +u + \varepsilon \]
    • \(X_{\mathrm{nstud}}\): average classroom size; \(X_I\): dummy for inner areas taxonomy (1: inner area, 0: central area), \(\ell\): longitude, \(\phi\): latitude; \(u\): latent spatially-structured Gaussian process

Usage example: spatial model

  • \(u\) defined over a continuous spatial domain with prior distribution: \(u \sim N(0, \Sigma)\) → \(\Sigma\): Matérn correlation matrix
see: Banerjee, S., Carlin, B.P., Gelfand, A.E. (2014). Hierarchical Modeling and Analysis for Spatial Data, 2nd edition, Chapman and Hall/CRC
  • \(u\) is approximated with a Gaussian Markov Random Field defined on a 2-d manifold
Lindgren, F., Rue, H., Lindström, J. (2011). An Explicit Link between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach, JRSS B, 73(4), pp 423–498
  • Model estimated using the Integrated Nested Laplace Approximation (INLA)
Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. JRSS B, 71(2), 319-392.

Usage example: spatial model

  • Full model specification
    • \(\beta\) terms have flat priors
    • \(\varepsilon \sim N(0, \sigma_{\varepsilon}^2)\), where \(\sigma_{\varepsilon}^{-2} \sim \mathrm{Gamma}(1, 5 \cdot 10^{-5})\)
    • \(\Sigma\) has variance and range parameters \(\sigma^2\) and \(r\), both of which are assigned a penalised compexity prior, where \(r_0 = 350 \mathrm{km}\) and \(\sigma_0 = 2.5\): \[ \left \{ \begin{array}{ll} prob \left( r < r_0 \right) = 0.95 \\ prob \left( \sigma > \sigma_0 \right) = 0.95 \end{array} \right. \]

Spatial regression model results

  • Classroom size has not a significant effect anymore
  • Significant disadvantage of inner areas
  • Strong North-South trend

Spatial regression model results

Estimated linear trend (left) and latent spatial field (right)

Estimated linear trend (left) and latent spatial field (right)

Interpreting the spatial effect

  • North-South and pole/periphery differences already kept into account

  • Hyperparameters → median \(\sigma^2\) (marginal variance of \(u\)) about \(31\) points; median range about \(84.7\) km

Interpreting the spatial effect

  • Darkest regions: education vulnerability not captured by covariates
    • Po Delta; Gargano peninsula; Naples hinterland; upper Ionian Calabria; Northwestern Sardinia
  • Brightest regions: relatively advantaged territories
    • Inland Sicily; Central Italy inland; Salento peninsula

→ What does this example tell us?

  • Assessing the impact of class size on student outcomes is less simple than it seems → Auxiliary information such as the degree of centrality of municipalities or geographical information becomes crucial.